A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates
نویسندگان
چکیده
Web information extraction is the key part of web data integration. With the need of e-commerce website and the development of web design, web pages with multiple presentation templates arise. The current web information extraction systems are usually based on single presentation template, so web pages with multiple presentation templates can’t be extracted efficiently. This paper focuses on the extraction problem about web pages with multiple presentation templates. Four different kinds of this problem have been considered, and a novel method based on path entropy, presentation regularity and ontology knowledge is presented. The experiment indicates that this method is very promising and it achieves excellent recall and precision.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملAnalyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کاملRoadRunner for Heterogeneous Web Pages Using Extended MinHash
The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structure...
متن کاملExtracting Cyber Communities through Patterns
This paper proposes a new approach to the problem of extracting Web communities. Due to the variety of topics found on the Internet, discovering online communities becomes a really difficult task. In our method, we exploit the observation that pages across the Web refer to other pages of a community using similar text templates in their links. Therefore, by creating patterns that describe these...
متن کاملA Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages
World Wide Web is a vast and rapidly growing source of information. Web Pages contain a combination of unique data and template material, which is present across multiple pages to achieve high productivity of publishing. The template detection becomes a more attractive technique in the web pages, since the unknown template degrade the performance of web applications due to the irrelevant terms ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JSW
دوره 5 شماره
صفحات -
تاریخ انتشار 2010